17 February 2017, University of York

me

Macroecologist

Quality assurance

Stock control

workshop motivation

why Research Data Management?

Technological progress in Science

Technology is increasing data collection throughput

  • data are more complex and high-dimensional
  • existing databases can be merged to become bigger databases

Computing power allows more sophisticated analyses

  • even on "small" data
  • for every field "X" there is a "Computational X"

powerful statistical tools freely available

  • active open source communities developing data science tools
  • built on a foundation of key digital literacy skills

the internet

Information overload

  • one
  • two

replicability crisis

attempts to reproduce published results of many scientific experiments are difficult or impossible to replicate on subsequent investigation, either by independent researchers or by the original researchers themselves.

The big question:

How much can we trust published results?


  • convergence of a number of drivers.
    • some part of the scientific process
    • some statistical resulting of biases introduced by publication system:
    • some by methodological failures and errors.
      • some very low hanging fruit here

at the very least we need to aim towards reproducing result from code and data

the pitfalls of #OtherPeoplesData

culture wars

the core of reproducibility

  • provide access to code and data.
  • allows us to check section rawpublished results

more broader open sciences

focus on reproducibility underpinned by more fundamental qualities:

  • openness
  • transaparency
  • aceessibility
  • reusability
  • collaboration

#OpenScience

stiffling science?

critics say it's constraining, wasted time for which researchers won't be rewarded.

But this ignores:

  • the value of code and data to other researchers
  • th eimpact such research outputs can have
  • the technological capacity for producing these outputs through standard openly avalable tools.

clealy a balance is needed

but so higher standards are both possible and imperative

A more collaborative approach.

origins in open source software communities. see SCIENCE X PYTHON

Both selfish and self-less motivations. But change incentive system

Contributor

  • broader impact of research outputs like code and data
  • more opportunity for recognition for diverse but useful outputs
  • skills are very transferable and employable

Community benefits

  • innovation
  • efficient reuse of resources
  • capacity building
  • quality control

Drivers of better data management

  • Funders: value for money, impact, reputation
    • NERC data policy
  • Publishers: many now require code and data.
  • Supervisors and immediate research group
  • Your wider scientific community

Yourselves!

be your own best friend:

aim to create secure datasets that are easy to use and REUSE

so…where do you come in?

RDM at postgraduate level

Plan your RDM

  • Start early. Make an RMD plan before collecting data.

  • Anticipate data products as part of your thesis outputs
  • Think about what technologies to use

Take initiative. You are now responsible:


BES guide to data management


This guide for early career researchers explains what data and data management are, and provides advice and examples of best practices in data management, including case studies from researchers currently working in ecology and evolution.


download

Nine simple ways to make it easier to (re)use your data


We describe nine simple ways to make it easy to reuse the data that you share and also make it easier to work with it yourself. Our recommendations focus on making your data understandable, easy to analyze, and readily available to the wider community of scientists.


download

data carpentry

Data collection

Before collection

  • if you're going to record data manually, produce data entry sheets or tables. Design them with both ease of entry and digitisation in mind.
  • if you are doing fieldwork consider using a pencil!
  • think in good time about variable definitions and naming
  • think about what other ancilliary data might be useful. Include in your collection plan


record…everything!

Even details that seem insignificant may prove vital to your research, or to interoperability with others' work:

Data entering

an extreme but in many ways defendable approach

excel: read only

spreadsheet advice

databases: more robust

  • good qc and advisable for multiple contributors

databases: benefits




data formats

  • csv files most common
  • Text files are the most general file format
  • more unusual formats will need instruction on use.

ensure data is machine readable

bad

bad

good

ok

  • could help data entry
  • .csv copy would need to be saved.

file naming

  • Give files descriptive, consistent names, e.g. “AK_IUCNDataExtraction_20131002.csv”
    • include any pertinent parameter information

machine writeable = machine readable

write.csv(df, paste("variable_", res, month, sep ="_"))

df <- read.csv(paste("variable_", res, month, sep ="_"))

Use good null values

Missing values are a fact of life

  • Usually, best solution is to leave blank
  • NA or NULL are also good options
  • NEVER use 0. Avoid numbers like -999
  • Don’t make up your own code for missing values

read.csv() utilities

  • na.string: character vector of values to be coded missing and replaced with NA to argument eg
  • strip.white: Logical. if TRUE strips leading and trailing white space from unquoted character fields
  • blank.lines.skip: Logical: if TRUE blank lines in the input are ignored.
  • fileEncoding: if you're getting funny characters, you probably need to specify the correct encoding.
read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE, 
         blank.lines.skip = TRUE, fileEncoding = "mac")

basic quality control

Have a look at your data with Viewer(df)



  • Check empty cells
  • Check the range of values (and value types) in each column matches expectation. Use summary(df)
  • Check units of measurement
  • Check your software interprets your data correctly eg. for a data frame df``; -head(df)` will show you and str(df) are useful)

make your data alignable and generalisable

What information would other users require to combine their your data with their's?

  • time temporal (time of day, day, month, year, season)
  • space geography (lat, lon)
  • taxonomy species name; authority / source
  • provide information on extent and resolution
  • consider writing some simple QA tests (eg. checks against number of dimensions, sum of numeric columns etc)

data management

raw data are sacrosanct

from raw to analytical data

the repeatable pipeline

Do not manually edit raw data

Keep a clean pipeline of data processing from raw to analytical.

  • Ideally, incorporate checks to ensure correct processing of data through to analytical.
  • Clearly document any processing.
    • Well-commented script-based manipulation (e.g. in R) makes this easy

Know your masters

  • identify the master copy of files
  • keep it safe and and accessible
  • consider version control
  • consider centralising


avoid catastrophy

Backup on disk

Backup in the cloud

  • dropbox, googledrive etc.
  • if installed on your system, can programmatically access them through R
  • some version control

The Open Science Framework osf.io

  • version controlled
  • easily shareble
  • works with other apps
  • work on an interface with R (OSFr) is in progress